Knowledge-Based Search in Collections of Digitized Manuscripts: First Results
نویسندگان
چکیده
The paper presents in brief a methodology for development of tools for knowledge-based search in repositories of digitized manuscripts. It is designated to assist the search activities in collections that may enlist XML documents which should be catalogue descriptions or marked-up full texts of mediaeval manuscripts. The suggested methodology is directed to the development of software environments that will be able to deal with relatively complex user queries containing words or phrases that are considered as domain concepts. The emphasis in this methodology falls on two main types of activities: development of proper ontologies describing the conceptual knowledge relevant to the chosen domain(s) and development of proper intelligent agents for search and processing purposes that are able to retrieve and filter documents by their semantic properties. Some considerations related to the implementation of our methodology are presented as well. The first version of a software tool for knowledge-based search in repositories of digitized manuscripts is discussed. Some results of the application of the tool to a collection of approximately 800 descriptions of mediaeval Bulgarian manuscripts stored in Bulgaria are analyzed.
منابع مشابه
Tools for Intelligent Search in Collections of Digitized Manuscripts
The paper presents a methodology for development of tools for semantics oriented search in repositories of digitized manuscripts. This methodology is based on the application of some Semantic Web technologies to existing manuscript collections that may include: electronic catalogue containing marked-up manuscript descriptions, full texts of manuscripts, digital images of manuscript pages. It is...
متن کاملFeature Extraction in Segmented Words for Semi-automatic Transcription of Handwritten Arabic Documents
Scanning is a widely used solution for the preservation of ancient manuscripts. However, this solution gives masses of document images which content is not easily exploitable. In this work, we propose a new method that reduces considerably the manual transcription. The aim is to explore the content of digitized manuscripts. The proposed method is based on two main phases: the first one consists...
متن کاملFinding information in books: characteristics of full-text searches in a collection of 10 million books
Searching large collections of digitized books is a relatively new area in information-seeking and retrieval research, made possible by initiatives such as Google Books and the HathiTrust Digital Library. The availability of large full-text book collections is transforming how users search and interact with information in books, but the characteristics of these changes are unknown. This paper a...
متن کاملVisualizing Document Image Collections Using Image-Based Word Clouds
In this paper, we introduce image-based word clouds as a novel tool for a quick and aesthetic overviews of common words in collections of digitized text manuscripts. While OCR can be used to enable summaries and search functionality to printed modern text, historical and handwritten documents remains a challenge. By segmenting and counting word images, without applying manual transcription or O...
متن کاملExperiments on Large Scale Document Visualization using Image-based Word Clouds
In this paper, we introduce image-based word clouds as a novel tool for a quick and aesthetic overviews of common words in collections of digitized text manuscripts. While OCR can be used to enable summaries and search functionality to printed modern text, historical and handwritten documents remains a challenge. By segmenting and counting word images, without applying manual transcription or O...
متن کامل